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Abstract 

Syntactic features play an essential role 
in identifying relationship in a sentence. 
Previous neural network models often suf¬ 
fer from irrelevant information introduced 
when subjects and objects are in a long 
distance. In this paper, we propose to 
learn more robust relation representations 
from the shortest dependency path through 
a convolution neural network. We fur¬ 
ther propose a straightforward negative 
sampling strategy to improve the assign¬ 
ment of subjects and objects. Experimen¬ 
tal results show that our method outper¬ 
forms the state-of-the-art methods on the 
SemEval-2010 Task 8 dataset. 


subsequences or clauses, especially when subjects 
and objects are in a longer distance. For exam¬ 
ple, in the sentence, “The [singerJei , who per¬ 
formed three of the nominated songs, also caused a 
[commotion]e 2 on the red carpet”, the who clause 
is used to modify subject ei, but is unrelated to 
the Cause-Effect relationship between singer and 
commotion. Incorporating such information into 
the model will hurt the extraction performance. 
We therefore propose to learn a more robust rela¬ 
tion representation from a convolution neural net¬ 
work model that works on the simple dependency 
path between subjects and objects, which naturally 
characterizes the relationship between two nomi- 
nals and avoids negative effects from other irrele¬ 
vant chunks or clauses. 


1 Introduction 


The relation extraction (RE) task can be defined as 
follows: given a sentence S with a pair of nomi- 
nals ei and 62 , we aim to identify the relationship 
between ei and 62 . RE is typically investigated 
in a classification style, where many features have 
been proposed, e.g., Hendrickx et al. (2010| de¬ 
signed 16 types of features including POS, Word- 
Net, FrameNet, dependency parse features, etc. 
Among them, syntactic features are considered to 
bring significant improvements in extraction accu¬ 


racy (Bunescu and Mooney, 2005aI. Earlier at¬ 
tempts to encode syntactic information are mainly 
kernel-based methods, such as the convolution tree 


kernel (Qian et al., 20081, subsequence kernel 
(|Bunescu and Mooney, 2005b I, and dependency 


tree kernel ( |Bunescu and Mooney, 2005a l. 

With the recent success of neural networks in 
NEP, different neural network models are pro¬ 
posed to learn syntactic features from raw se¬ 


quences of words or constituent parse trees! Zeng 
|et al., 20 14 Socher et al., 2012 1, which have been 
proved effective, but, often suffer from irrelevant 


Our second contribution is the introduction of 
a negative sampling strategy into the CNN mod¬ 
els to address the relation directionality, i.e., prop¬ 
erly assigning the subject and object within a re¬ 
lationship. In the above singer example, {singer, 
commotion) hold the Cause-Effect relation, while 
{commotion, singer) not. Previous works do not 
fully investigate the differences between subjects 
and objects in the utterance, and simply transform 
a (k'-i-l)-relation task into a (2xk'-i-l) classifica¬ 
tion task, where 1 is the other relation. Interest¬ 
ingly, we find that dependency paths naturally of¬ 
fer the relative positions of subjects and objects 
through the path directions. In this paper, we pro¬ 
pose to model the relation directionality by ex¬ 
ploiting the dependency path to learn the assign¬ 
ments of subjects and objects using a straightfor¬ 
ward negative sampling method, which adopts the 
shortest dependency path from the object to the 
subject as a negative sample. Experimental results 
show that the negative sampling method signifi¬ 
cantly improves the performance, and our model 
outperforms the-state-of-the-art methods on the 
SemEval-2010 Task 8 dataset. 



















The recipient receives the call through a miniature radio receiver carried on his person 
receiver Instrument-Agency recipient receiver •*— carried —► radio ■*— through ■*— receives —► recipient 


Figure 1: The shortest dependency path representation for an example sentence from SemEval-08. 


2 The Shortest Path Hypothesis 

If ei and 62 are two nominals mentioned in the 
same sentence, we assume that the shortest path 
between ei and 62 describes their relationship. 
This is because (1) if ei and 62 are arguments of 
the same predicate, then their shortest path should 
pass through that predicate; ( 2 ) if ei and 62 belong 
to different predicate-argument structures, their 
shortest path will pass through a sequence of pred¬ 
icates, and any consecutive predicates will share 
a common argument. Note that, the order of the 
predicates on the path indicates the proper assign¬ 
ments of subjects and objects for that relation. For 
example, in Figure[T] the dependency path consec¬ 
utively passes through carried and receives, which 
together implies that in the Instrument-Agency re¬ 
lation, the subject and object play a sender and re¬ 
ceiver role, respectively. 
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Figure 2: Architecture of the convolution neural 
network. 


3 A Convolutional Neural Network 
Model 

Our model successively takes the shortest depen¬ 
dency path (i.e, the words, dependency edge direc¬ 
tions, and dependency labels) from the subject to 
the object as input, passes it through the lookup 
table layer, produces local features around each 
node on the dependency path, and combines these 
features into a global feature vector that are then 
fed to a softmax classifier. Each dimension of the 
output vector indicates the confidence score of the 
corresponding relation. 

In the lookup table step, each node (i.e. word, 
label or arrow) in the dependency path is trans¬ 
formed into a vector by looking up the embedding 
matrix We G , where d is the dimension of 

a vector and V is a set of all nodes we consider. 

Convolution To capture the local features 
around each node of the dependency path, we con¬ 
sider a fixed size window of nodes around each 
node in the window processing component, pro¬ 
ducing a matrix of node features of fixed size 
dw X 1 , where dw = d x w and w is fhe window 
size. This mafrix can be built by concatenating the 
vectors of nodes within the window. 

In the convolutional layer, we use a linear trans¬ 
formation Wi G to extract local features 

around each window of the given sequence, where 
m is the size of hidden layer 1. The resulting ma¬ 
trix Z has size of rei x t, where t is the number of 
nodes in the input dependency path. 

We can see that Z captures local contextual in¬ 
formation in the dependency path. Therefore, we 
perform a max pooling over Z to produce a global 
feature vector in order to capture the most useful 
local features produced by the convolutional layer 
( [Collobert et ah, 201 1| |, which has a fixed size of 
ni, independent of the dependency path length. 

Dependency based Relation Representation 

To extract more meaningful features, we choose 
hyperbolic tanh as the non-linearity function in the 
second hidden layer, which has the advantage of 
being slightly cheaper to compute, while leaving 













































Train Strategy 

Test Strategy 

Fl(%) 

Blind 

Blind 

79.3 

Sighted 

Blind 

81.3 

Sighted 

Sighted 

89.2 


Table 1: Performances on the development set 
with different train and testing strategies. 

the generalization performance unchanged. W 2 G 
Rnzxni linear transformation matrix, where 
712 is the size of hidden layer 2. The output vec¬ 
tor can be considered as higher level syntactic fea¬ 
tures, which is then fed to a softmax classifier. 

Objective Function and Learning The softmax 
classifier is used to predict a itT-class distribution 
d(x), where K is the size of all possible rela¬ 
tion types, and the transformation matrix is W 3 G 
]^it'xn2_ denote t{x) G as the target 

distribution vectoiQ the entry tk{x) is the proba¬ 
bility that the dependency path describes the ^-th 
relation. We compute the cross entropy error be¬ 
tween t{x) and d{x), and further define fhe objec¬ 
tive function over all training data: 


assignments of subjects and objects. By com¬ 
paring the first and the second one, we can see 
that when adding assignment information during 
training, our model can be significantly improved, 
indicating that our dependency based representa¬ 
tion can be used to learn the assignments of sub¬ 
jects/objects, and injecting better understandings 
of such assignments during training is crucial to 
the performance. We admit that models with more 
complex structures can better handle these con¬ 
siderations. However, we find that this can be 
achieved by simply feeding typical negative sam¬ 
ples to the model and let the model learn from such 
negative examples to correctly choose the right as¬ 
signments of subjects and objects. In practice, we 
can treat the opposite assignments of subjects and 
the objects as negative examples. Note that, the 
dependency path of the wrong assignment is dif¬ 
ferent from that of the correct assignment, which 
essentially offers the information for the model to 
learn to distinguish the subject and the object. 

5 Experimental Evaluation 


K 

X k=l 


where 9 = {We, Wi, W2, W3) is the set of model 
parameters to be learned, and A is a vector of reg¬ 
ularization parameters. The model parameters 9 
can be efficiently computed via backpropagation 
through network structures. To minimize J{ 9 ), 
we apply stochastic gradient descent (SGD) with 
AdaGrad (Duchi et ah, 2011 1 in our experiment^ 


4 Negative Sampling 

We start by presenting three pilot experiments on 
the development set. In the first one, we assume 
that the assignment of the subject and object for 
a relation is not given (blind), we simply extract 
features from ei to 62 , and test it in a blind set¬ 
ting as well. In the second one, we assume that 
the assignment is given (sighted) during training, 
but still blind in the test phase. The last one is as¬ 
sumed to give the assignment during both training 
and test steps. The results are listed in Table[^ 

The third experiment can be seen as an upper 
bound, where we do not need to worry about the 

'Note that, there may he more than one relation existing 
between two nominals. A dependency path thus may corre¬ 
spond to multiple relations. 

^We omit detailed formulas for the limitation of space. 


We evaluate our model on the SemEval-2010 Task 
8 ( [Hendrickx et ah, 20101 ), which contains 10,717 
annotated examples, including 8,000 instances for 
training and 2,717 for test. We randomly sampled 
2,182 samples from the training data for valida¬ 
tion. 

Given a sentence, we first find the shortest de¬ 
pendency path connecting two marked nominals, 
resulting in two dependency paths corresponding 
to two opposite subject/object directions, and then 
make predictions for the two paths, respectively. 
We choose the relation other if and only if both 
predictions are other. And for the rest cases, we 
choose the non-other relation with highest confi¬ 
dence as the output, since ideally, for a non-other 
instance, our model will output the correct label 
for the right subject/object direction and an other 
label for the wrong direction. We evaluate our 
models by macro-averaged FI using the official 
evaluation script. 

We initialized We with 50-dimensional word 


vectors trained by Turian et al. (2010 1 . We tuned 
the hyper parameters using the development set for 
each experimental setting. The hyper parameters 
include w, ni, 02 , and regularization parameters 
for We, Wi, W2 and W3. The best setting was ob¬ 
tained with the values: 3, 200, 100, 10“^, 10“^, 


10 ^ and 2 X 10 respectively. 
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Method 

Feature Sets 

El 

SVM 

16 types of features 

82.2 

RNN 

- 

74.8 



+POS, NER, WordNet 

77.6 

MVRNN 

- 

79.1 



+POS, NER, WordNet 

82.4 

CNN 

- 

78.9 


Zeng et al., 2014 

+WordNet,words around nominals 

82.7 

depCNN 

- 

81.3 

depLCNN 

- 

81.9 

depLCNN 

+WordNet,words around nominals 

83.7 

depLCNN+NS 

- 

84.0 



+WordNet,words around nominals 

85.6 


Table 2: Comparisons of our models with other 
methods on the SemEval 2010 task 8. 


Negative sampling schemes 

El 

No negative examples 

81.3 

Randomly sampled negative examples from NYT 

83.5 

Dependency paths from the object to subject 

85.4 


Table 3: Comparisons of different negtive sam¬ 
pling methods on the development set. 


Results and Discussion Table[^summarizes the 
performances of our model, depLCNN-i-NS(-i-), 
and state-of-the-art models, SVM( |Hendrickx ^ 
), RNN, MV-RNN dSocher et al, 20T2| ), 
and CNN( |Zeng et al., 2014[ ). For fair comparisons, 
we also add two types of lexical features, WordNet 
hypemyms and words around nominals, as part of 
input vector to the final sof tmax layer. 

We can see that our vanilla depLCNN-i-NS, 
without extra lexical features, still outperforms, by 
a large margin, previously reported best systems, 
MVRNN-i- and CNN-i-, both of which have taken 
extra lexical features into account, showing that 
our treatment to dependency path can learn a ro¬ 
bust and effective relation representation. When 
augmented with similar lexical features, our de- 
pLCNN-i-NS further improves by 1.6%, signifi¬ 
cantly better than any other systems. 

Let us first see the comparisons among plain 
versions of depLCNN (taking both dependency di¬ 
rections and labels into account), depCNN (con¬ 
sidering the directions of dependency edges only), 
MVRNN and CNN, which all work in a 2xA'-i-l 
fashion. We can see that the both of our depCNN 
and depLCNN outperforms MVRNN and CNN by 
at least 2.2%, indicating that our treatment is better 
than previous conventions in capturing syntactic 
structures for relation extraction. And note that de¬ 
pLCNN, with extra considerations for dependency 
labels, performs even better than depCNN, show¬ 
ing that dependency labels offer more discrimina- 


al., 2010 


live information that benefits the relation extrac¬ 
tion task. 

And when we compare plain depLCNN and 
depLCNN-i-NS (without lexical features), we can 
see that our Negative Sampling strategy brings an 
improvement of 2.1% in FI. When both of the 
two models are augmented with extra lexical fea¬ 
tures, our NS strategy still gives an improvement 
of 1.9%. These comparisons further show that our 
NS strategy can drive our model to learn proper 
assignments of subjects/objects for a relation. 

Next, we will have a close look at the effect 
of our Negative Sampling method. We conduct 
additional experiments on the development set to 
compare two different negative sampling methods. 
As a baseline, we randomly sampled 8,000 nega¬ 


tive examples from the NYT dataset (Chen et al.. 


20141. For our proposed NS, we create a nega¬ 


tive example from each non-other instance in the 
training set, 6,586 in total. As shown in Table 
it is no doubt that introducing more negative ex¬ 
amples improves the performances. We can see 
that our model still benefits from the randomly 
sampled negative examples, which may help our 
model learn to refine fhe margin befween fhe pos- 
ifive and negafive examples. However, wifh sim¬ 
ilar amounf of negafive examples, freafing fhe re¬ 
versed dependency pafhs from objecfs fo subjecfs 
as negafive examples can achieve a beffer perfor¬ 
mance (85.4% FI), improving random samples by 
1.9%. This again proves fhaf dependency pafhs 
provide useful clues fo reveal fhe assignmenfs of 
subjecfs and objecfs, and a model can learn from 
such reversed pafhs as negafive examples fo make 
correcf assignmenfs. Beyond fhe relafion exfrac- 
fion fask, we believed fhe proposed Negafive Sam¬ 
pling mefhod has fhe pofenfial fo benefif ofher 
NLP fasks, which we leave for fulure work. 


6 Conclusion 

In fhis paper, we exploif a convolution neural nef- 
work fo learn more robusf and effeclive relafion 
represenfafions from shorfesf dependency pafhs 
for relafion exfracfion. We furfher propose a sim¬ 
ple negafive sampling mefhod fo help make cor¬ 
recf assignmenfs for subjecfs and objecfs wifhin 
a relafionship. Experimenfal resulfs show fhaf 
our model significanlly oulperforms slale-of-lhe- 
arf sysfems and our freafmenf fo dependency pafhs 
can well capfure fhe synfacfic fealures for relafion 
exfracfion. 
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